Take-home_Ex03

Author

Huynh Minh Phuong

Getting started

FishEye International, a non-profit organization dedicated to combatting the scourge of illegal, unreported and unregulated (IUU) fishing, has been granted access to fishing-related companies’ financial database, offered by an international finance corporation. Through their previous investigations, FishEye has discovered that companies with unusual arrangements are more likely to be engaged in IUU activities or other questionable practices. To leverage this valuable resource, FishEye has transformed the database into a comprehensive knowledge graph, which encompasses data on companies, owners, workers and revenue.

The primary objective of our project is to employ this graph to detect irregularities that may indicate a company’s involvement in IUU fishing.

Dataset contains the knowledge graph with 27,622 nodes and 24,038 edges:

Node Attributes:

  • type – Type of node as defined above.

  • country – Country associated with the entity. This can be a full country or a two-letter country code.

  • product_services – Description of product services that the “id” node does.

  • revenue_omu – Operating revenue of the “id” node in Oceanus Monetary Units.

  • id – Identifier of the node is also the name of the entry.

  • role – The subset of the “type” node, not in every node attribute.

Edge Attributes:

  • type – Type of the edge as defined above.

  • source – ID of the source node.

  • target – ID of the target node.

  • dataset – Always “MC3”.

  • role - The subset of the “type” node, not in every edge attribute.

Load packages

Show code
pacman::p_load(jsonlite, tidygraph, ggraph,igraph,lsa, 
               visNetwork, graphlayouts, ggforce, 
               skimr, tidytext, tidyverse, plotly, naniar,
               tm, topicmodels, ldatuning, ggridges,
               ggstatsplot, text2vec)

Load data set

Show code
mc3_data <-fromJSON("data/MC3.json")

Extracting edges

We first extract links data into a tibble using as_tibble()

Show code
mc3_edges <-as_tibble(mc3_data$links)
glimpse(mc3_edges)
Rows: 24,038
Columns: 3
$ source <list> "Lake Chad  Catchers Limited Liability Company Worldwide", "La…
$ target <list> "Erin Flores", "Linda Lee", "Sharon Coleman", "John Rivera", "…
$ type   <list> "Beneficial Owner", "Beneficial Owner", "Beneficial Owner", "B…

Next, we will wrangle the data for edges:

  • distinct() is used to ensure that there will be no duplicated records.

  • mutate() and as.character() are used to convert the field data type from list to character.

  • group_by() and summarise() are used to count the number of unique links.

  • the filter(source!=target) is to ensure that no record with similar source and target.

Show code
mc3_edges <-mc3_edges %>% 
  distinct() %>%
  mutate(source=as.character(source),
         target=as.character(target),
         type=as.character(type)) %>% 
  group_by(source, target, type) %>% 
  summarise(weights=n()) %>% 
  filter(source!=target) %>% 
  ungroup()

Extracting nodes

We use as_tibble() to extract node data into a tibble.

Show code
mc3_nodes <-as_tibble(mc3_data$nodes)

We wrangle the node data:

  • mutate() and as.character() are used to convert the field data type from list to character.

  • To convert revenue_omu from list data type to numeric data type, we need to convert the values into character first by using as.character(). Then, as.numeric() will be used to convert them into numeric data type.

  • select() is used to re-organise the order of the fields.

Show code
mc3_nodes<-mc3_nodes %>% 
  mutate(country=as.character(country),
         id=as.character(id),
         product_services=as.character(product_services),
         revenue_omu=as.numeric(as.character(revenue_omu)),
         type=as.character(type)) %>% 
  select(id, country, type, revenue_omu, product_services)

Initial Data Exploration

Edge data

In the code chunk below, skim() of skimr package is used to display the summary statistics of mc3_edges tibble data frame.

Show code
skim(mc3_edges)
Data summary
Name mc3_edges
Number of rows 24036
Number of columns 4
_______________________
Column type frequency:
character 3
numeric 1
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
source 0 1 6 700 0 12856 0
target 0 1 6 28 0 21265 0
type 0 1 16 16 0 2 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
weights 0 1 1 0 1 1 1 1 1 ▁▁▇▁▁

The report above reveals that there is no missing values in all fields.

datatable() of DT package is used to display mc3_edges tibble data frame as an interactive table on the html document.

Show code
DT::datatable(mc3_edges)

We use ggplot to plot the frequency of different types of relationship in the edges. We have two types: beneficial owner and company contacts.

Show code
type_freq<-ggplot(mc3_edges, aes(x=type))+
  geom_bar()+
  theme_minimal()+
  ggtitle('Frequency Count by Relationship Type')

ggplotly(type_freq)  

We take a look at the number of companies under each owner. Most owners have only 1 company. We will extract the graph for those owners with more than 3 companies under them. There are 67 such owners. We will look at the network of these owners.

Show code
owner_companycnt<-mc3_edges %>% 
  filter(type=='Beneficial Owner') %>% 
  group_by(target) %>% 
  summarise(company_count=n()) %>% 
  arrange(desc(company_count)) %>% 
  ggplot(aes(x=company_count)) +
  geom_bar()+
  scale_x_continuous(breaks=c(1:10))+
  theme_minimal()+
  ggtitle('Frequency Count by Number of Companies under each owner')
ggplotly(owner_companycnt)
Show code
# Extract onwers with more than 3 companies
owners<-mc3_edges %>% 
  filter(type=='Beneficial Owner') %>% 
  group_by(target) %>% 
  summarise(company_count=n()) %>% 
  filter(company_count>3)

Node data

Use skim() of skimr package is used to display the summary statistics of mc3_nodes tibble data frame.

Show code
skim(mc3_nodes)
Data summary
Name mc3_nodes
Number of rows 27622
Number of columns 5
_______________________
Column type frequency:
character 4
numeric 1
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
id 0 1 6 64 0 22929 0
country 0 1 2 15 0 100 0
type 0 1 7 16 0 3 0
product_services 0 1 4 1737 0 3244 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
revenue_omu 21515 0.22 1822155 18184433 3652.23 7676.36 16210.68 48327.66 310612303 ▇▁▁▁▁

There are missing values in revenue_omu but no missing values in other variables: id, country, type, product_services

datatable() of DT package is used to display mc3_nodes tibble data frame as an interactive table on the html document.

Show code
DT::datatable(mc3_nodes)

Plot the frequency of type of nodes using geom_bar()

Show code
ggplot(data = mc3_nodes,
       aes(x = type)) +
  geom_bar()+
  theme_minimal()

Show code
cp_plot<-mc3_nodes %>% 
  filter(type=='Company') %>% 
  group_by(id) %>% 
  mutate(country_count=n_distinct(country)) %>%
  arrange(desc(country_count)) %>% 
  ggplot(aes(x=country_count))+
  geom_bar()+
  scale_x_continuous(breaks=c(1:10))+
  theme_minimal()+
  ggtitle('Frequency Count by Number of Countries of each company')

ggplotly(cp_plot)

We will save companies that operate in multiple countries in cp_multicount

Show code
cp_multicount<-mc3_nodes %>% 
  filter(type=='Company') %>% 
  group_by(id) %>% 
  summarise(entry_cnt=n()) %>%
  filter(entry_cnt>1) %>% 
  ungroup()

Let’s also take a look at the distribution of different countries for the companies.

Show code
mc3_nodes %>%
  filter(type=='Company') %>%
  select(id, country) %>% 
  distinct() %>%
  group_by(country) %>% 
  mutate(cnt=n()) %>% 
  ggplot(aes(x=reorder(country, cnt)))+
  geom_bar()+
  theme_minimal()+
  ggtitle('Frequency Count by Countries of Company')+
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))

Identify shell companies from ownership network graph

Owners with high number of companies

Let’s take a look at the network of owners with high number of companies under them. We want to check if these companies under these owners are also co-owned by other owners. If a owner owns many companies that are not co-owned by others, these are likely shell companies.

Show code
# Extract company information 
cp_oh<-mc3_edges %>%
  filter(type=='Beneficial Owner') %>% 
  filter(target %in% owners$target) %>% 
  select(source)

# Extract edge information
oh_edges <- mc3_edges %>%
  filter(type=='Beneficial Owner') %>% 
  filter(target %in% owners$target | source %in% cp_oh)

# Extract node information  
oh_id1<-oh_edges %>% 
  select(source) %>%
  rename(id = source) %>% 
  mutate(type='Company') 

oh_id2 <- oh_edges %>%
  select(target, type) %>%
  rename(id = target) 

oh_nodes <- rbind(oh_id1, oh_id2) %>% 
  distinct()

# Create owner_graph
oh_graph <- as_tbl_graph(oh_edges, directed = FALSE)

oh_graph<-oh_graph %>% 
  activate(nodes) %>% 
  left_join(oh_nodes, by=c("name"="id")) %>% 
  mutate(betweenness_centrality=centrality_betweenness()) %>%
  mutate(degree_centrality=centrality_degree())

oh_graph
# A tbl_graph: 362 nodes and 313 edges
#
# An unrooted forest with 49 trees
#
# A tibble: 362 × 4
  name                          type    betweenness_centrality degree_centrality
  <chr>                         <chr>                    <dbl>             <dbl>
1 Acevedo, Dickson and Gonzalez Company                      0                 1
2 Adams Group                   Company                      0                 1
3 Adams-Pope                    Company                      0                 1
4 Adriatic Catch S.A. de C.V.   Company                      0                 1
5 Albertine Rift  NV Family     Company                      0                 1
6 Alexander PLC                 Company                      0                 1
# ℹ 356 more rows
#
# A tibble: 313 × 4
   from    to type             weights
  <int> <int> <chr>              <int>
1     1   296 Beneficial Owner       1
2     2   297 Beneficial Owner       1
3     3   298 Beneficial Owner       1
# ℹ 310 more rows
Show code
oh_graph %>%
  ggraph(layout = "fr") +
  geom_edge_link() +
  geom_node_point(aes(colour=type,
                      alpha=0.5,
                      size=betweenness_centrality)) +
  scale_size_continuous(range=c(1,10))+
  theme_graph()

We can create an interactive graph to get the names of the companies and beneficial owners by hovering over the nodes.

Show code
edges_df<-oh_graph%>%
  activate(edges) %>% 
  as.tibble() 

nodes_df<-oh_graph%>%
  activate(nodes) %>% 
  as.tibble() %>%
  rename(label=name) %>%
  rename(group=type) %>% 
  mutate(id=row_number()) 

visNetwork(nodes=nodes_df,edges=edges_df)%>%    
  visIgraphLayout(layout = "layout_with_fr") %>%  
  visLegend() %>%
  visEdges(arrows = "to", 
           smooth = list(enabled = TRUE,              
                         type = "curvedCW")) %>%
  visNodes(font = list(size=30)) %>% 
  visLayout(randomSeed=123) %>%    
  visOptions(highlightNearest = TRUE, 
             nodesIdSelection = TRUE)

Group companies by industries based on analysis of product and services

Topic modelling to identify industries

We first needs to tokenize the words from products and services description and filter companies from node data.

Show code
token_nodes <- mc3_nodes %>%
  filter(type=='Company') %>% 
  unnest_tokens(word, 
                product_services) 
token_nodes
# A tibble: 64,202 × 5
   id                                   country type    revenue_omu word       
   <chr>                                <chr>   <chr>         <dbl> <chr>      
 1 Jones LLC                            ZH      Company  310612303. automobiles
 2 Coleman, Hall and Lopez              ZH      Company  162734684. passenger  
 3 Coleman, Hall and Lopez              ZH      Company  162734684. cars       
 4 Coleman, Hall and Lopez              ZH      Company  162734684. trucks     
 5 Coleman, Hall and Lopez              ZH      Company  162734684. vans       
 6 Coleman, Hall and Lopez              ZH      Company  162734684. and        
 7 Coleman, Hall and Lopez              ZH      Company  162734684. buses      
 8 Aqua Advancements Sashimi SE Express Oceanus Company  115004667. holding    
 9 Aqua Advancements Sashimi SE Express Oceanus Company  115004667. firm       
10 Aqua Advancements Sashimi SE Express Oceanus Company  115004667. whose      
# ℹ 64,192 more rows

We perform a frequency count for the words.

Show code
token_nodes %>%
  count(word, sort = TRUE) %>%
  top_n(15) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(x = word, y = n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip() +
      labs(x = "Count",
      y = "Unique words",
      title = "Count of unique words found in product_services field")

The bar chart reveals that the unique words contains some words that may not be useful to use. For instance “a” and “to”. In the word of text mining we call those words stop words. You want to remove these words from your analysis as they are fillers used to compose a sentence. We also will filter common words such as unknown and products, which will not be helpful in distinguishing the companies.

Show code
stopwords_removed <- token_nodes %>%
  filter(!word %in% c('unknown', 'products','related')) %>% 
  anti_join(stop_words)
Show code
stopwords_removed %>%
  count(word, sort = TRUE) %>%
  top_n(15) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(x = word, y = n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip() +
      labs(x = "Count",
      y = "Unique words",
      title = "Count of unique words found in product_services field")

Following the completion of cleaning and visualization, the subsequent stage involves conducting Latent Dirichlet Allocation (LDA). LDA is an iterative algorithm utilized to reveal topics by examining discrete word frequencies. The underlying notion behind LDA is that documents typically pertain to a limited number of topics, and these topics are generally based on a small set of words. However, prior to that, it is necessary to construct a Document Term Matrix. This matrix mathematically represents the occurrence frequency of terms within a collection of documents. In the document-term matrix, each row corresponds to a document in the collection, while each column corresponds to a term.

Show code
dtm<- stopwords_removed %>%
  count(id, word) %>% 
  cast_dtm(id, word, n) %>%  
  as.matrix()

We use ldatuning to select the number of topic for LDA model.

Show code
result <- FindTopicsNumber(
  dtm,
  topics = seq(from = 2, to = 20, by = 1),
  metrics = c("Griffiths2004", "CaoJuan2009", "Arun2010", "Deveaud2014"),
  method = "Gibbs",
  control = list(seed = 77),
  mc.cores = 2L,
  verbose = TRUE
)
fit models... done.
calculate metrics:
  Griffiths2004... done.
  CaoJuan2009... done.
  Arun2010... done.
  Deveaud2014... done.

One straightforward method for analyzing metrics involves identifying extrema. For a more comprehensive understanding, please refer to the relevant papers:

Minimization:

  • Arun2010 [1]

  • CaoJuan2009 [2]

Maximization:

  • Deveaud2014 [3]

  • Griffiths2004 [4,5]

To facilitate easy analysis of the outcomes, the FindTopicsNumber_plot support function can be utilized.

Show code
FindTopicsNumber_plot(result)

From the graph, it appears that 7 topics are optimal.

Show code
lda_topics <- LDA(
  dtm,
  k = 7,
  method = "Gibbs",
  control = list(seed=42)
  ) %>%
  tidy(matrix = "beta")
word_probs <- lda_topics %>%
  group_by(topic) %>%
  top_n(15, beta) %>%
  ungroup() %>%
  mutate(term2 = fct_reorder(term, beta))
ggplot(
  word_probs,
  aes(term2, beta, fill=as.factor(topic))
  ) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  coord_flip()

We can see that topic 5 overlap with topic 3 and topic 4. We can try to reduce the number of topics to 5.

Show code
lda_topics <- LDA(
  dtm,
  k = 5,
  method = "Gibbs",
  control = list(seed=42)
  ) %>%
  tidy(matrix = "beta")
word_probs <- lda_topics %>%
  group_by(topic) %>%
  top_n(15, beta) %>%
  ungroup() %>%
  mutate(term2 = fct_reorder(term, beta))
ggplot(
  word_probs,
  aes(term2, beta, fill=as.factor(topic))
  ) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  coord_flip()

We can see that better distinction in the topics:

  • topic 1: accessories and materials

  • topic 2: food

  • topic 3: equipment

  • topic 4: fish and seafood

  • topic 5: services

Label the company by industry and observe patterns within industry

We will use tidy() to get the probability of the topic for each company. Then we will take a look at each industry in terms of their network and interactions and disregard companies with unknown products and services.

Show code
# Extract probability of the topic for each company
lda_topics <- LDA(
  dtm,
  k = 5,
  method = "Gibbs",
  control = list(seed=42)
  ) %>% 
  tidy(matrix = "gamma")

# Summarize company revenue
company<-mc3_nodes %>%
  filter(type=='Company') %>% 
  group_by(id) %>% 
  summarise(revenue=sum(revenue_omu))
Show code
# Assign topic with the highest gamma score to the document/company
cp_map<-lda_topics %>% 
  group_by(document) %>% 
  summarise(gamma=max(gamma)) 

cp_map<-cp_map %>% 
  left_join(lda_topics) %>% 
  mutate(topic=recode(topic, '1'="accessories_materials",
                      '2'="food",
                      '3'="equipment",
                      '4'="fish_seafood",
                      '5'="services")) %>%
  rename("Industry"="topic") %>% 
  select(document, Industry)

# Use left join to join back to company revenue info
company<-company %>%
  left_join(cp_map, by=c("id"="document"))

# Drop those company with unknown industry
company<- company %>% 
  drop_na(Industry) %>% 
  drop_na(revenue)

One company can be in a few industries at the same time.

Identify the patterns within each industry

Revenue pattern

We can check the distribution of revenue pattern. We need to remove the extreme outliers by using IQR.

Show code
company %>% 
  ggplot(aes(x=Industry, y=revenue))+
  geom_boxplot()+
  theme_minimal()

Show code
company_rmo<-company %>% 
  group_by(Industry) %>% 
  mutate(revenue_cleaned=ifelse(revenue>quantile(revenue, 0.75, na.rm=T)+1.5*IQR(revenue, na.rm=T) | revenue<quantile(revenue, 0.25, na.rm=T), NA, revenue)) %>% 
  na.omit()

We observed right skewed distribution of revenue in all industries.

Show code
company_rmo %>% 
  ggplot(aes(x = revenue, 
           y = Industry,
           fill=factor(after_stat(quantile))
           )) +
stat_density_ridges(
    geom = "density_ridges_gradient",
    calc_ecdf = TRUE, 
    quantiles = 4,
    quantile_lines = TRUE) +
  scale_fill_viridis_d(name = "Quartiles") +
  theme_ridges()+
  ggtitle("Distribution of Revenue across different Industries")+
  theme_minimal()+
  ylab("Industries")+
  xlab("Revenue")

We performed statistical analysis to find out if the mean revenue is different for the different industries. From the results, we can see that equipment and food industry have the highest mean revenue. Fishing, accessories/materials and services have lowest mean revenue.

Show code
ggbetweenstats(
  data = company_rmo,
  x = Industry, 
  y = revenue,
  type = "np",
  mean.ci = TRUE, 
  pairwise.comparisons = TRUE, 
  pairwise.display = "s",
  p.adjust.method = "fdr",
  messages = FALSE
) + 
  ggtitle("Compare mean revenue among Industries")+
  xlab("Industries")